Problem statement

Monitoring data (BEMS & DEMS):

  1. Energy (MWh)
  2. Temperature (C)
  3. Flow (m3/h)
  4. Volume (m3)
  5. Power (kw)

Data loading

Data Wrangling

Time variable conversion

The datetime columnn is converted to datetime.datetime type.

Unfortunately we can see that not all dataframes contain all variables:

This is the crucial date where the extra variables are added to the dataset. New variables are added the 14th of May 2021 at 09:15:00am. From then on the dataset is full.

The new variables added are the following 29:

Missing data

No missing data initially but there gonna be injected soon.

Time series dataset creation

First of all a dictionary of dataframes is created. Each dataframe refers to one meter / variable and contains the columns: datetime & value. Thus we now have as many timeseries datasets as the number of meters which need to be concatenated to form a unified timeseries dataset of all variables (meters). Outer join is performed so that variables that do not take values at all timesteps are retained and filled with a NaN value.

As timeseries do not share all the time range from 11/20 to 06/21 NaNs are used to fill the gaps. Something good to observe is that datetime column consists of 23232 datetimes which is exactly what we should observe for 4 quarters x 24 hours x n_days where n_days = 30 (Nov) + 31 (Dec) + 31 (Jan) + 28 + 31 + 30 + 31 + 30 (June) = 23232 quarterly timesteps.

Invalid data

It is observed that on 2021-03-19 10:15:00 all energy meters have some erroneous measurements as they stop the ascending pattern and repeat the last 3 values.

These values will be replaced first by NaN and then by middle steps of the previous and next values.

Load variables creation

In this section some new time series are created from differentiating the energy time series in order to create new mean power time series that are available for all 6 months and can be useful for forecasts. These time series lack in terms of precision as the smallest decimal point refers to 1/100 MWh which leads to a precision of 10kw in contrast to pure power meters that offer precision of 0,01 kW. However in terms of total loads (where precision is less important) they are expected to provide useful forecasts

Conversion of mean power (kw) to energy (Mwh)

To later compare differentiated energies (average 15min loads) with respective measured powers we need to convert power to 15min loads. The conversion is kw to Mwh for a timestep of 15 min so we need to divide powers by 4 * 1000. This conversion creates the final version of the 15 min timestep dataset

Resampling

We create a new load only dataset and change its timestep to 30min and 60min in order to render diff loads more similar to true loads. For cumulative measurements (loads, energies) aggregation should be sum. For the rest of measurements (flows, temperature, powers) we need to consider a mean aggregator.

Auxiliary calendar information dataset

In this section a calendar is created containing auxiliary info for all unique dates that appear in the core dataset. Holidays are exracted for the province of Valladolid using the holiday-es python package.

As only 7 holidays are included in our study period we observe 672 rows that have the holiday variable set to True. This happens because every day has 24x4 quarterly timesteps and 24x4x7 = 672

At this point mixed dataset (timeseries + calendar features) are also created.

EDA

Calendar EDA

We validate again that no timesteps are missing from the dataset by ensuring that no time difference amongst successive timesteps is larger than 15 minutes

Observations from calendar EDA

  1. We have data for all the weekdays without equal counts though. However, data starts on Saturday (6) and ends on Tuesday (2) so this is expected.
  2. Month data quantity is fluctuating according to fluctuation of days of each month (28, 30. 31)
  3. As we have 6 month data from 2021 and 2 from 2020 the year distribution looks OK.
  4. Same thing for the 7 holidays in total that are observed in our dataset.

Time series dataset EDA

Distribution plots (histogram + density + cumulative density) for all variables:

Original time series plots (15 min timestep)

Gas meters

Normally here we would observe purely increasing trends but probably the counters were reset in late November 2020. Unit is cubic meters (m3)

Energy meters

Machine Temperatures

Machine temperatures are captured from mid May and after... Not many things to observe for now except for the pure daily patterns that are visible in the graph due to the small sample size.

True loads

True loads are extracted from average powers (multiplication with time [kw to kwh])

Flow

Volumes

Demandas

Load time series (differentiated energy time series) plots - 15 minute dataset

We can observe that the total active load is much larger than the other variables so it is also summed up by other variables too.

Comparison with power measurements for the 3 datasets

15 minute dataset

To compare differentiated energies (average 1 hour loads) with respective measured powers we need to convert power to 15min loads. The conversion is kw to Mwh for a timestep of 1 hour so we need to divide powers by 1000.

There is a pretty decent overlapping amongst corresponding true / diff time series.

30 minute dataset
Hourly dataset

We can observe that discrepancies are somehow mitigated amongst true and diff loads for larger timesteps. This will to process the 1 hour timestep dataset in the next steps.

Calendar based exploration for the total active power synthetic variable

Observations:

  1. Load peaks are observed mainly during the afternoon until 21.00
  2. Wednesdays along with Fridays and Saturdays exhibit the highest of loads
  3. The total loads are much higher during the winter monts. June has the lowest mean value (July to October are not valid as there is no data for these months)
  4. Holidays weekends and working days do not affect much the mean values of the total load

Correlograms

Correlograms amongst all variables

Observations

Inter variable scatterplots

Scatterplots along with their regression lines are also illustrated amongst the most iteresting couples as described above.

Extra observations:

  1. Plots concerning the newly created diff load variables clearly exhibit the rounding effect which stems from the computation of these variables from low resolution energy measurements (measured in MWh with 2 decimal points). This effect is obvious from points distributed in a discrete manner.
  2. Linear relationships with slope 1 are observed amongst true and diff loads as expected.

Final comments

In the created datasets:

ACF / PACF

Example decomposition to capture daily seasonality